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The multivariate analysis techniques of cluster 
analysis, principal components analysis, and discriminant 
analysis are examined in this thesis. The theory and 
applications of each of the techniques are discussed. 
Computer software available at the Naval Postgraduate School 
is discussed and sample jobs are included. 

A hierarchical cluster analysis algorithm, available in 
the IMSL software package, is applied to a set of data 
extracted from a group of subjects for the purpose of 
partitioning a collection of 26 attributes of a weapon 
system into six clusters of superattributes. 

A nonhierarchical clustering procedure, principal 
components analysis, and discriminant analysis were all 
applied to a collection of data on tanks considering of 
twenty-four observations of ten attributes of tanks. The 
cluster analysis shows that the tanks cluster somewhat 
naturally by nationality. The principal components analysis 
and the discriminant analysis show that tank weight is the 
single most important discriminator among nationality. 
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I. DISCUSSION OF MULTIVARIATE DATA ANALYSIS 



A. INTRODUCTION 

As a set of statistical techniques, multivariate data 
analysis is concerned with data collected on several dimen- 
sions of the same observations. Techniques can be used for 
many purpose in the behavioral, mathematical, and adminis- 
trative sciences - ranging from rigidly controlled experi- 
ments to explain relationships assumed to be present in a 
large mass of data to attempts to cluster similar elements 
or to find functions of the variables that will best 
discriminate among preselected subpopulations of the 
observations . 

The heart of any multivariate analysis consists of the 
data matrix. This matrix is a table that gives the results 
of a number of observations on a number of variables 
simultaneously (Table I) . 
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Illustrative Data Matrix 



Variables 



Observations 


1 


2 


3 


j • • 


. . p 


1 


^11 


^12 


^13 




IP 


2 


^21 


^22 


^23 


^2j 


X-, 

2p 



X . , X. - X . , X, . X- 

il i2 i3 Ij ip 



n 



X 

nl 



X 

n2 



X 

n3 



X 

nj 



X 

np 



TABLE I. 



The table consists of a set of observations (the n 
rows) and a set of measurements on those observations (the 
p columns). Cell entries represent the value x^^ of 
observation i on variable j . The values are 
characteristics of the observations and serve to define the 
observations in any specific study. The cell values may 
consist of nominal, ordinal, interval, or ratio-scaled 
measurements, or various combinations of these across columns. 
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In a general sense "multivariate” analysis would con- 
cern two main features: 

1. The multivariate character lies in the 
multiplicity of the p variables, not 
in the size of the set n . 

2. The variables are dependent among them- 
selves so that we can not split off one 
or more from the others and consider it 
by itself. The variables must be 
considered together. 

There are three characteristics often used as a basis 
for the classification of multivariate analysis: 

1. whether one's principal focus is on the 
objects or on the variables of the data 
matrix; 

2. whether the data matrix is partitioned 
into criterion and independent subsets, 
and the number of variables in each; 

3. whether the cell values represent 
nominal, ordinal, or interval scale 
measurements . 

This classification results in four major subdivisions of 
interest : 

1. single criterion, multiple predictor 

association, including multiple regression, 
analysis of variance and covariance, and 
two-group discriminant analysis; 
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2. multiple criterion, multiple predictor 
association, including canonical correla- 
tion, multivariate analysis o£ variance 
and covariance, and multiple discriminant 
analysis ; 

3. analysis of variable interdependence, 
including factor analysis, multidimen- 
sional scaling, and other types of 
dimension-reducing methods; 

4. analysis of interobject similarity, 
including cluster analysis and other 
types of grouping procedures. 

The first two categories involve dependence structures 
where the data matrix is partitioned into criterion and 
independent subsets; in both cases interest is focused on 
the variables. The last two categories are concerned with 
interdependence - either focusing on variables or on 
observations. Within each of four categories, various 
techniques are differentiated in terms of the type of scale 
assumed. 

In this research, we consider only the following 
techniques of multivariate analysis: 

1. Principal components analysis 

2. Discriminant analysis 

3. Cluster analysis 
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II. PRINCIPAL COMPONENTS ANALYSIS 



The basic idea of principal components analysis is to 
describe the dispersion of an array of n points in p- 
dimensional space by introducing a new set of orthogonal 
linear coordinates so that the sample variances of the 
given data points with respect to these derived coordinates 
are in decreasing order of magnitude. Thus the first 
principal component is such that the projection of given 
points onto it have maximum variance among all possible 
linear coordinates; the second principal component has 
maximum variance subject to being orthogonal to the first; 
and so on. 

Suppose that the random variables , X2 , 

X of interest have a certain multivariate distribution 
P 

with finite mean vector u and variance-covariance matrix 

Z . 

From this population a sample of n independent 
observation vectors has been drawn. The observation can 
be written as the usual nxp data matrix. 



X 
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X = 



CD 



til 
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The estimate of Z will be the usual sample variance- 
covariance matrix S defined as follows: 



S = 



n - 1 



n 



C2) 



A = E CX . - X) (X. - X) • 
j = 1 J J 



The information we shall need for our principal compo- 
nents analysis will be contained in S . However, it will 
be necessary to make a choice of measures of dependence: 
should we work with the variances and covariances of the 
observations, and carry out the analysis in original unit 
of the responses, or would a more accurate picture of the 
dependence pattern be obtained if each were trans- 

formed to a standarized score 



X. . - X. 

Z.. = ^ 1 



and the correlation matrix R employed? The components 
obtained from S and R in general not the same, nor is 
it possible to pass from one solution to the other by a 
simple scaling of the coefficients. 

If the responses are in widely different units (i.e., 
number of crew, weight in tons, speed in kilometer per 
hour, etc.) with large differences in the magnitudes, 
linear compounds of original quantities would have little 
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meaning and standarized variates and correlation matrix 
should be employed. Conversely, if the responses are 
reasonably commensurable, the covariance form has a greater 
statistical appeal, for the i-th principal component is 
that linear compound of the responses which explains the 
i-th largest portion of the total response variance, and 
maximization of such total variance of standard scores is 
rather artificial. 

The first principal component of the complex of sample 

values of the responses , X 2 , ’ ^p 

linear compound 



Yi = a^^X, . 



pi p 



(3) 



whose coefficients a^^^ are the elements of the eigenvector 
associated with the greatest eigenvalue of the sample 

variance-covariance matrix of the responses. The a^^^ are 
are unique up to multiplication by a scale factor, and if 
they are scaled so that “ 1 » the eigenvalue is 

interpretable as the sample variance of . 

Numerical representation of the first principal compo- 
nent is to find the vector such that 



Yi = a^^X^ . 



+ a ,X 
pi p 



= AJX 



(4) 
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which maximizes sample variance 



s 



Y 



2 

1 



P 

I 



i = 1 



P 

z 



^il^jl^ij 



= A'^SAj^ 



(5) 



for all coefficient vectors normalized so that 1 • 

To determine the coefficients, the normalization constraint 
is introduced by means of Lagrange multiplier and the 
resulting expression is differentiated with respect to A'^^, 








[Aj^'SA^ + 



X^(l - A^'A^)] 



= 2(S - X^l)A^ 



C6) 



The coefficients must satisfy the p simultaneous linear 
equations . 



(S - X^1)A^ = 0 (7) 

If the solution to these equation is to be other than the 
null vector, the value of must be chosen so that 

|S - X^ll = 0 C8) 

X^ is thus an eigenvalue of the variance-covariance matrix, 
and A^ is its associated eigenvector. To determine which 
of the p eigenvalues should be used, premultiply the 
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the system of equation ( 7 ) by . Since ^ 

follows that 



X 



1 




S = 




2 



But the coefficient vecotr was chosen to maximize this 
variance, and therefore, must be the greatest eigen- 

value of S . 

The second principal component is that linear compound 



^2 “ ^ 12^1 



+ a . X 

p2 p 



( 9 ) 



whose coefficients have been chosen, subject to the con- 
straints 



A2- A2 = 1 
A^' A^ = 0 



( 10 ) 



so that the variance of Y2 , A2' S A2 , is a maximum. 

The first constraint is merely a scaling to assure the 
uniqueness of the coefficients, while the second requires 
that Aj^ and A2 be orthogonal. 

The coefficients of the second component can also be 
found by the Lagrangian technique with two multipliers X2 
and y ♦ Differentiating this with respect to A2 gives: 
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( 11 ) 



3A2^^2' S A2 + ^2(1 - A2' ^2^ ^2^ 

= 2(S - X2l)A2 + UA^ 

If the right-hand side is set equal to 0 and premultiplied 
A]^ ' , it follows from the normalization and orthogonality- 
conditions that 



2 A 2 ' S A 2 + y = 0 (12) 

Similar premultiplication of the equation (7) by A 2 ' 
implies that 



A 2 ’ S A 2 = 0 



(13) 



and hence y = 0 . The second vector must satisfy 



(S - X 2 DA 2 = 0 (14) 

And it follows that the coefficients of the second component 
are thus the elements of the eigenvector corresponding to 
the second greatest eigenvalue. The remaining principal 
components are found in their, turn in the same manner from 
the other eigenvectors. 

Thus the j-th principal component of the sample of 
p-variate observations is the linear compound 



Y. = a^.X^ ^ 



+ a .X 
PJ P 



(15) 
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whose coefficients are the elements of the eigenvector of 
the sample variance-covariance matrix S corresponding to 
the j-th largest eigenvalue . If X^ 7 ^ X^ , the 

coefficients of the i-th and j-th components are 
necessarily orthogonal; if elements can be 

chosen to be orthogonal, although an infinity of such 
orthogonal vectors exists. The sample variance of the 
j-th components is Xj , and the total system variance is 
thus 

X, + X- + + X„ = tr S (16) 

1 Z p 

The importance of the j-th component in a more parsimonious 
description of the system is measured by 

(17) 

t7^ 

which gives the fraction of the total variance contributed 
to the j-th component. 
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III. DISCRIMINANT ANALYSIS 



A. INTRODUCTION 

The basic idea of discriminant analysis consists of 
assigning an individual from a group of individuals to one 
of several known or unknown distinct propulations , on the 
basis of observations on several characters of the indi- 
vidual or group and a sample of observations on these 
characters from the populations if these are unknown. 

Fisher (1936) was the first to suggest a linear 
function of variables representing different characters, 
hereafter called the linear discriminant function (discrimi- 
nator) for classifying an individual into one of two popu- 
lations. Later research extended the analysis to classifica- 
tion into one of k populations. 

For the univariate case Fisher suggested a rule which 
classifies an observation x into the i-th univariate 
population if 

X - = min (X - , X - X 2 ) , i = 1,2 (18) 

where X^ is the sample mean based on a sample of size N^ 
from i-th population. For two p-variate populations 
7 T^ and tt 2 (with the same covariance matrix) Fisher 
replaced the vector random variable by an optimum linear 
combination of its components obtained by maximizing the 
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ratio of the difference of the expected values of a linear 
combination under and tt 2 to its standard deviation. 

He then used his univariate discrimination method with this 
optimum linear combination of components as the random 
variable . 

Rao (1948) considered the problem of classifying 
people into one of these populations castes of India. He 
assumed that each of the three populations could be 
characterized by four variables - structure (x^^) , sitting 
height (X 2 ) , nasal depth (x^) , and nasal height (x^) - of 
each member of the population. On the basis of sample 
observations on these characters from the three populations 
the problem is to classify an individual with observation 
X = (x^,X 2 jXjjX^) into one of three populations. He used 
a linear discriminator to obtain the solution. 

B. THEORY 

In general, the underlying assumptions of discriminant 
analysis are: 

1. the groups being investigated are discrete 
and identifiable; 

2. each observation in each group can be 
described by a set of measurements on p 
characteristics or variables; 

3. these p variables are assumed to have 

a multivariate normal distribution in each 
population . 
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The purposes of discriminant analysis are: 

1. to test for mean group differences and to 
describe the overlaps among groups; 

2. to construct classification schemes based 
upon the set of p variables in order to 
assign previously unclassified observations 
to the appropriate groups. 

Hence, the problem of studying the direction of group 
differences is, equivalently, a problem of finding a linear 
combination of the original independent variables that 
shows large differences in group means. In short, dis- 
criminant analysis is a method for determining scuh linear 
combinations . 

The first step toward determining a linear combination 
of a set of variables such that several group means on this 
linear combination will differ widely among themselves, is 
to decide on a criterion for measuring such group -mean 
differences. Once a linear combination has been constructed, 
that means there is just a single tranformed variable. 

Hence, the F-ratio for testing the significance of the 
over all difference among several group means on a single 
variable suggests an appropriate criterion. 




( 19 ) 



where v' = (vj^,V 2 > 



V ) , a set of weight which 
P 



maximizes X . 
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B 



i - j. 




W 




^ij ^ observation vector in the i-th 

group . 

X is the grand mean vector of the data. 

G is the number of groups. 

n^ is the number of observations in the ith group. 

Prime notation indicates transpose. 

This ratio X , called the discriminant criterion, was 
originally proposed by Fisher in connection with his two- 
group discriminant function. Once a criterion for group 
differentiation has been determined, a set of weights, 

(Vf ^2 , ^p^ ’ which maximizes this criterion, 

should be determined. This is accomplished by taking the 
partial derivative of X with respect to each component 
v^ of V and setting the result equal to zero. 



8X _ 2 [ (Bv) Cv* Wv) 



(v* Wv) (Wv) ] 



(v* Wv) 



( 20 ) 



2(Bv - Wv) _ p 
v' Wv 
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which is equivalent to 



This equation is 



(B - XW)v = 0 
CW^B - AI)v = 0 

of the form 

(A - XI)v = 0 



( 21 ) 



( 22 ) 



It's solution, yielding the eigenvalues X and associated 
' P 

eigenvectors Vp of the matrix A , is therefore the same 
as in the principal components analysis, and thus the solved 
problem satisfies the problem of maximizing the discriminant 
criterion . 

In the last equation, the number of non- zero eigenvalues 
of a square matrix A is equal to the rank of A . With 
W”^B playing the role of A , the number of non- zero eigen- 
values depends on the rank of B , since the rank of the 
product of two matrices can not exceed the smaller of the two 
factor matrices' ranks, and W'^ (being nonsingular) must be 
of full rank p , while the rank of B is usually smaller 
than p . Thus it is possible to denote the rank of B by 
r = min (G-l,p) . 

From the fact that the eigenvalues Xp are the values 
assumed by the discriminant criterion for linear combination 
using the elements of the corresponding eigenvectors P 
as combining weights, it is clear that the eigenvector 
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Vi' * provides a set of weights such 

that the transformed variable 

Yj - * V12X2 + * v^pXp ( 23 ) 

has the largest discriminant-criterion, X , achievable by 
any linear combination of the p independent variables. 

What are the properties of the remaining eigenvectors, 
V 2 JVJ, ,Vp ? The second discriminant function 

^2 = ^21^1 ^ ^22^2 * * ^2p^p whose weights are the 

elements of the eigenvector V 2 associated with the second 
largest eigenvalue X 2 of W’^B has the largest 
discriminant-criterion among those linear combinations of 
the that are uncorrelated with the first discriminant 

function in the total sample observation. Its proof is 
analogous of that of princpal components analysis. Each 
discriminant function has a relative (or conditional) 
maximum value for its discriminant criterion. Therefore, it 
needs nonly to show that Y 2 is uncorrelated with . 

Noting that this correlation is proportional to v-|^’Tv 2 
(where T = W + B) , we have to prove that Vj^*Tv 2 = 0 . 

(B - X^W)v^ = 0 for each i (24) 



hence , 



BVf = X^ WVj^ and BV 2 = 
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premultiplying these equations by V 2 ' and ' respectively, 



Vz’BVi = Aj^V2'Wv 



BVz = A2 V^'Wv 



C25) 



taking the transpose of both sides of the first equation (B 
and W are symmetric) 



thus 



since 



Vi'BV^ = XjV^'WV^ 



XjVj'WV2 = 

(Xi - Xs)Vj'WV2 = 0 

X, t V.'WVj = 0 



therefore, V^'WV 2 = 0 which means and are 

uncorrelated, and Y 2 has this property: its discriminant - 

criterion value, X 2 > largest achievable by any 

linear combination of X's that is uncorrelated (in the 
total sample) with . Similarly 



Y 



3 






op p 



(36) 



has the largest possible discriminant-criterion value (A^) 
among all linear combinations of the X's that are uncorrelated 
with both Yj^ and Y 2 ; and so on until Y^ using the 
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elements of as weights, has the largest possible 

discriminant-criterion value among linear combinations that 
are uncorrelated with all the preceding linear combinations 

Y^,Y2> ’^r -1 ' linear combinations Y^,Y2, 

. . ,Y^ are called the first, second, r th (linear) 

discriminant functions for optimally differentiating among 
the g given groups . 

The situation here is reminiscent of principal com- 
ponents analysis. There, the dimension corresponding to the 
first component had maximum variance; the second-component 
dimension had maximum variance among those uncorrelated with 
the first; and so on. In discriminant analysis, the ratio 
of between-to within-groups sums -of-squares merely takes the 
place of variance as the criterion in determining the 
successive dimensions. However, an important difference 
between the dimensions identified in discriminant analysis 
and those in component analysis is that the former are 
generally not mutually orthogonal in the test space, even 
though they are uncorrelated. That is, the axis representing 
the discriminant functions are not a subset of axes obtainable 
by rigid rotation of the original system of p axes; the 
discriminant rotation in an oblique rotation. 

Just as in the principal components analysis, the 
dimensions represented by the discriminant functions may be 
interpreted meaningfully. Even if they are not, it may be 
possible to achieve parsimony by reducing the dimensionality 
of the space needed to describe group differences. In 
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seeking to interpret the discriminant functions, the goal is 
to determine which of the original p variables contribute 
most to each function. For this prupose, comparison of the 
realtive magnitudes of the combining weights as given by the 
elements of each eigenvector of W is inappropriate 
because these are weights to be applied to the variables in 
raw-score scales, and are hence affected by the particular 
unit used for each variable. 

To eleminate the spurious effects of units of measure- 
ment on the magnitudes of combining weights, standarized 
variables should be used. 

The relative magnitudes of these standarized weights may 
be assessed by multiplying each raw-score weight by the 
standard deviation of the corresponding variable as computed 
from the within-groups SSCP (Sum of Squares, Cross product) 
matrixj This amounts to multiplying each element of a given 
eigenvector by the square root of the corresponding 

diagonal element of W . Thus, for each m , define 



V . * - w . . V . 

mi 11 mi 



i = 1,2, 



C27) 



as the standarized discriminant weights. The relative con- 
tribution of the i th variable to the m th discriminant 
function may then be gauged by the magnitude of v^^^* in 
comparison with the other weights * ' 

Up to this point, it has been shown that the dimension- 
ality of the discriminant space is equal to the number of 
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nonzero eigenvalues of W’^B , which is the smaller of the 
two numbers, G-1 and p . It may often happen, that the 
number of significant discriminant dimensions may be even 
smaller. That is, not all of the discriminant function may 
represent dimensions along which statistically significant 
group differences occur. 

C. SIGNIFICANCE TEST IN DISCRIMINANT ANALYSIS 

A basic quantity in testing the significance of the 
overall difference among several group centroids (mean 
vectors) the ratio of the determinants of the within- 
groups and the total SSCP matrices, known as Wilks' A 
criterion. 



A = 



lw| 

|T1 



(28) 



Motivation for use of this equation may be seen as follows: 

1. = 1^1 1 w" ^T I 

A i^i 

= iW‘^(W + B) I (29) 

= (1 + A) (1 + X 2 ) ;..••,(! + X^) 

where X^,X 2 ,...,X^ are the nonzero eigenvalues of W ^B . 
Consequently, Bartlet's V statistic for testing the 
significance of an observed value can be expressed as 



28 



V = - [N - 1 - (p + G)/2]lnA 

= [N - 1 - Cp + G)/2] In [(1 + X^)(l + X^)] (30) 

= [N - 1 - (p + G)/2] E ln(l + Xm) 

m = 1 

This statistic is distributed approximately chi-square with 
p(G-l) degrees of freedom. 

Because of the uncorrelatedness of the successive 
discriminant functions, the successive terms ln(l + X^^^) 
in the last expression above are statistically independent 
(assuming multivariate normality of the original p 
variables). As a result, the additive components of V are 
each approximately distributed as a chi-square variate. 

More specifically, the m th component, 

= [N - 1 - (p + G)/2] In (1 + XJ (31) 

is approximately chi-square with p + G - 2m degrees of 
freedom. It may be readily verified that the sum of the 
number of degree of freedom (n.d.f) of the r components, 

that is, (p + G - 2) + (p + G = 4) + + (p + G - 2r) , 

is equal to p(G - 1) regardless of whether r= G - 1 or 

P • 

Consequently, when we cumulatively subtract ’ 

and so on from V , the remainder each time is also a 
chi-square variate; and these successive remainders become 
appropriate statistics for testing whether the residual 
discrimination after removing the first discriminant 
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function, the first and second discriminant function, and 
so forth, is statistically significant. The successive 
test statistics and their n.d.f.'s may be summarized as 
follows : 



Residual After 
Removing 


Approximate 
- Statistic 


n.d. f . 


First discriminant 
Function 


V - 


p(G-l) - (p+G- 2 ) 
= (p-l)(G- 2 ) 


First 2 discriminant 
Function 


V - - V2 


(p-l)(G- 2 )-(p+G- 4 ) 
= (p- 2 )(G- 3 ) 


First 3 discriminant 
Function 


V - - V2 - V3 


(p- 2 ) (G- 3 )-(p+G- 6 ) 
= (p- 3 )(G- 4 ) 


1 

1 


1 

1 

1 

1 


1 

1 

1 

1 


First s discriminant 
Function 


V-v^-V2-V3...-V3 


(p-s) (G- (s + 1 ) ) 

1 



As soon as the residual, after removing the first s 
discriminant functions becomes smaller than the prescribed 
percentile point (that is, the 100(1 - a)th percentile) of 
the appropriate chi-square distribution, we may conclude 
that only the first s discriminant functions are 
significant at that a level. If the number of signifi- 
cant discriminant functions thus found is smaller than r 
(as will often be the case) , we will have effected a further 
reduction in the dimensionality of the space required to 
describe the differences among the G groups from which 
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our sample groups were drawn. The remaining r-s dimensions 
may be regarded as immaterial for population differentiation, 
since our sample differences along these dimensions can be 
attributed to sampling error. 
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IV. CLUSTER ANALYSIS 



A. ORIGIN AND THEORY 

Clustering is the grouping of similar objects. The 
principal functions of clustering are to name, to display, 
to summarize, to predict, and to aid in interpretation of 
data with many dimensions. Clustering techniques were 
first developed in the field of biological taxonomy. It is 
one of several methodologies included in the broader cate- 
gory called classification. 

The cluster analysis problem is the last step we 
consider in the progression of category sorting problems. 
While in discriminant analysis some part of the structure 
is known and missing information is estimated from labeled 
samples, the operational objectives of clustering is to 
classify new observations, that is, recognize them as members 
of one category or another. In cluster analysis little or 
nothing is known about the category structure. All that is 
available is a collection of observations whose category 
membership are known. We seek to discover a category 
structure which fits the observations. The problem may be 
stated as one of finding the "natural groups", which means 
to sort the observations into groups such that the degree of 
"natural association" is high among members of the same 
group and low between members of different groups. 
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Cluster analysis techniques have been applied in many 
fields of study. The literature is both voluminous and 
diverse, the terminology differing from one field to 
another. ''Numerical taxonomy" is frequently substituted 
for cluster analysis among biologists, botanists, and 
ecologists, while some social scientists may refer "typology". 
Other frequently encountered terms are pattern recognition 
and partitioning. While discriminant analysis has been 
studied by statisticians for nearly 45 years, cluster 
analysis has only recently come to statistical notice. Any 
method which partition a set of objects into subsets on 
the basis of measurements taken on every object qualifies 
as a clustering method. 

Most of the well known clustering techniques fall into 
one of two main categories: (1) hierachical and (2) non- 

hierachical (partitioning) . The former is one in which 
every cluster obtained at any stage is a merger of clusters 
at previous stages. The nonhierachial procedures however 
form new clusters by lumping and splitting old ones. We 
consider both categories shortly. 

In a geometric sense, every observation may be viewed 
as a point in p-dimensional Euclidean space. This swarm of 
data points may contain dense regions or "clouds" of data 
points which are separable from other regions containing a 
low density of points. These denser regions constitute what 
are known as clusters. In one and two dimensional cases, it 
is easy to visualize and to detect the clusters from scatter 
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plots, assuming that the clusters exist. In higher 
dimensions, clustering becomes extremely difficult without 
the aid of a computer. 

Mathematical clustering techniques usually require a 
measure of similarity to be defined for every pairwise 
combination of the entities to be clustered. In order to 
solve the cluster problem, it is desirable to define the 
terms "similarity" and "difference" in a quantitative 
fashion. A researcher would assign two observations to 
the same group if the distance between them is sufficiently 
small, or to different clusters if this distance is 
sufficiently large. 

At this point, two questions may be brought on. The 
first one is "how do we measure the distance between the 
observations?" and the second one is "how small is small 
enough?" and how large is large enough? These will be 
discussed in the following sections. 

B. MEASURES OF DISTANCE 
1 . General 

Let Ep be a symbolic representation for a 
measurement in p-dimensional space and let X,Y, and Z be 
any of these points in Ep . Then any nonnegative real- 
valued function D(X,Y) satisfying the following conditions 
qualifies as a distance function (or metric) . 

1. D(X,Y) =0 if and only if X = Y 

2. D(X,Y) > 0 for all X and Y in E 
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3. D(X,Y) = D(Y,X) 

4. D(X,Y) < D(X,Z) + D(Y,Z) 



Many clustering algorithm assume such distances given and 
set about constructing clusters of objects within which the 
distances are small. The choice of distance function is no 
less important than the choice of variables to be used in 
the study. A serious difficulty in choosing a distance lies 
in the fact that a clustering structure is more primitive 
than a distance function and that knowledge of clusters 
changes the choice of distance function. Thus a variable 
that distinguishes well between two established clusters 
should be given more weight in computing distances than a 
"junk” variable that distinguishes badly. 

2 . Euclidean Distance 

The Euclidean distance between the I-th and K-th 
observations of a data matrix X is defined as 



D(I,K) = 



t1/2 



Z {X(I,J) 

.1 < J < P 



- X(K,J)} 



(32) 



where J is J-th variable. In one, two, or three 
dimensional space, this is just a "straight line" distance 
between the vectors corresponding to the I-th and K-th 
observations. When the variables are measured in different 
units, it is necessary to prescale the variabes to make 
their values comparable or, equivalently, to compute a 
weighted Euclidean distance. 
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D(I,K) 



Z 

. 1 ^ J 




This form o£ distance is not necessary if all variables are 
measured on the same scale. However, even in this case, 
weights might be used to increase or decrease the importance 
of same variable. Various weighting schemes have been 
utilized in practice. One common weighting scheme lets 
W(J) be the reciprocal of the variance of variable J . 

A general class of squared distance functions is 
provided by utilizing positive definite quadratic forms. 
Specifically, if p represents a p-dimensional observation 
to be assigned to one of s groups, then to measure the 
squared distance between the observation S and the 
centroid (mean vector) of the i-th group one may consider 
the function 



where M is a positive definite matrix to ensure that 

0 . Different distance functions are represented by 
different choices of the matrix M . When M = I (the 
identify matrix) the resulting metric is the standard 
Euclidean distance. Distances with the Euclidean metric are 
shown in Figure la. The variance within the data may make 
the unweighted Euclidean metric inappropriate. As shown on 
the Figure lb, where X has a larger variance than Y , 



(3 - )*^ M (B - x^ ) 
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one may wish to weight a deviation in the X direction less 
than an equal deviation in the Y direction. This is a 
weighted Euclidean distance frunction which makes point A 
and B equidistance from the origin. In this case, the 
matrix M is diagonal elements which are the reciprocals of 
the variances of the different variables. 

Extending this idea further, it may be possible to 
consider the covariance among variables as well. Figure Ic 
shows how the axis may be rotated so that the major axis is 
oriented in a direction of reflecting the positive correla- 
tion between X and Y . Again, points on the same 
ellipse are considered equidistance from the origin. The 
matrix M in this case is the inverse of the covariance 
matrix. 

Further extension of this concept will expalin some sort 
of generalized distance function. If represents the 

covariance matrix of the i th cluster then the distance 
function 

D. = (B - X. Ce - X. ) 

uses the appropriate covariance structure when determining 
the distance to a particular cluster centroid. Since 
changes to reflect the dispersion internal to each particular 
cluster, the use of this metric exploits differences in the 
dispersion characteristics of the different groups. As shown 
on Figure Id, not how a new observation (denoted by uj is 
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Figure 1. Euclidean Distance 
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closer to the centroid of group one (Gl) in terms of 
Euclidean distance but is more likely to be assigned to 
group two (G2) when using the matrix. 

3 . Mahalanobis Distance 

Another choice for the M matrix in equation ( 1 ) 
is p ^ where P represents the pooled within groups 
covariance matrix of all the clusters. 



where 



P = 



W 



~ir~ 

E 

i = 1 



(n. - 1} 



(34) 



W = 




1 



W 



k 



This distance is the well known Mahalanobis distance. Note 
that P does not change from group to group. To ensure 
the non-singularity of P it must be true that p £ (N - G) , 
where N represents the total number of observations over 
all groups. Rewriting the distance, 

Di = (6 - W^(6 - X. ) (35) 

defines a distance between mean vectors 6 and x. and 

JL • 

common covariance matrix W . The Mahalanobis distance 
function adjusts for both scale of measurement of the 
variables and covariation among the variables. Use of this 
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metric is equivalent to computing distances on variables 
transformed to their principal components. This metric is 
invariant under any nonsingular transformation of original 
variables. For consider the transformation 



Y = BX 



( 36 ) 



and let DCY^.Y^) 
Y- and Y. . 



represent Mahalanobis distance betv/een 



t'fi - 


Y. 

1 


) P 


, 

Y 


■CYi - 




) 


(BX^ 


- BX.) 


’■p^KBXi 


- 


BX.) 


CXi - 






T 

B 


B 


(Xi 


■ 


(Xi - 








(BP^B 


b- 


^B(X. 


(Xi - 


>'3 




Px' 


CXi 


- 


x^) 



= D(X.,X.) 



Some other common metrics are listed below; 
1. L-^ norm (City Block) 

DCXi.Xjl = ^ I Jx^. - Xyl 



2. Lp norm (Minkowsky Metrics) 



1 ' 1 



P 
Z 

k = 1 



D(X, ,X,) = ( 2 1P)^/P 
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3. Uniform norm 



D(X. ,X.) = Superemum ^ I X v ’ U 

^ k = 1, 2. ...,p ''J 

C. HIERARCHICAL CLUSTERING 
1 . General 

The previously discussed distance measures may be 
used to construct a similarity matrix describing the length 
of all pairwise relationships among the entities (variables 
or data units) in the data set. The methods of hierachical 
cluster analysis operate on this similarity matrix to con- 
struct a tree depicting specified relationships among the 
entities. As shown on Figure 2, the branches on the left 
each represent one entity while the root represents the 
entire collection of entities. Moving down the tree 
from the branches toward the root depicts increasing aggre- 
gation of the entities into clusters. Hierarchical 
clustering methods which build a tree from branches to 
root often are called agglomerative methods. 

Once a tree is constructed for N entities, the 
analyst may choose from as many as N sets of clusters. 
These clusters are nested. From the agglomerative view, 
when two entities are merged they are joined togehter per- 
manently and considered as one entity for later merges; from 
the divisive view, when a group of entities is split into 
two parts, the parts are separated permenently and may be 
treated independently for the remainder of the analysis. 
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Root 



Figure 2. Tree for Hierarchical Clustering 



Herein lie both the strength and weakness of 
hierarchical methods: by taking early decisions as perma- 

nent, the number of posibilities that need be examined is 
reduced greatly as compared with complete enumeration; 
but this same convention precludes discovering early 
mistakes or capitalizing on later opportunities. 

There are three major hierarchical clustering concepts: 

1. Linkage Methods 

2. Centroid Methods 

3. Error sum of squares or variance methods. 



All of these methods are suitable for clustering data units. 
However, only the linkage methods are considered in this 



research . 

2 . The General Agglomerative Procedure 

Let s^j be the similarity between entities i 
and j as defined by one of the distance measures previously 
discussed. Assuming that the similarity is symmetric, the 
complete schedule of similarities for all = jN(N - 1) 



42 



possible pairwise combinations of entities may be arrayed 
in a lower triangular similarity matrix as in Figure 3, 

The s^j entries are nonnegative. This limitation is of 
consequence only for correlation and the cosine of the angle 
between vectors; the distinction between positive and nega- 
tive association cannot be utilized in these clustering 
methods . 




Figure 3. Lower Triangular Similarity Matrix 

A simple remedy is to use the absolute value or the square of 
the measure if it can assume negative values. Once the 
matrix is defined, the process of clustering entities is 
almost trivially simple. The general procedure for agglomer- 
ative clustering on a data matrix is as follows: 

(1) Begin with n clusters each consisting of 
exactly one entity. Let the clusters are 
labled with the numbers 1 through N. 
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(2) Search the similarity matrix for the most 
similar pair of clusters. Let the chosen 
clusters be labeled p and q and let 
their associated similarity be s , d > q. 

pq ^ 

(3) Reduce the number of clusters by 1 thorugh 
merger of clusters p and q . Label the 
product of the merger q and update the 
similarity matrix entities in order to 
reflect the revised similarities between 
cluster q and all other existing clusters. 
Delete the row and column of S pertaining 
to cluster p . 

(4) Perform steps 2 and 3 a total of N-1 times 
(at which point all entities will be one 
cluster) . At each step record the identity 

of the clusters which are merged and the value 
of similarity between them in order to have 
a complete record of the results. 



Different agglomerative methods are implemented by 
varying the procedures used for defining the most similar 
pair at step 2 and for updating the revised similarity 
matrix at step 3. The similarity matrix is a given array 
of numbers. The numerical execution of the clustering 
procedures is completely independent of how the similarity 
values were generated or whether the entities to be 
clustered are variables or data units. However, it is 
necessary to make a direct distinction between distance-like 
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measures fthe smallest values correspond to the most similar 
pairs) and correlation- like measures (the largest values 
correspond to the most similar pairs) ; the essential 
difference is whether the search for the most similar pair 
involves seeking the minimum or maximum entry in the simi- 
larity matrix. 

3 . Single Linkage 

The method of single- linkage cluster analysis is 
the simplest of all hierarchical techniques. At each stage, 
after clusters p and q have been merged, the similarity 
between the cluster (labeled t) and some other r is 
determined as follows: 

1. If s^j is the distance-line measure 



s 



tr 



min 




qr 



( 37 ) 



The quantity s^^. is the distance between the two closest 
members of clusters t and r . If clusters t and r 
were to be merged, then for any entity in the resulting 
cluster the distance to its nearest neighbor would be at 



most s^^ . 

2 . 



If 






is a correlation-like measure 



s 



tr 



max (Sp^, 



^qr ^ 



( 38 ) 



The quantity s^^. is the similarity between the two most 
similar entities in clusters t and r . If clusters t 
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and r were to be merged, then for any entity in the 
resulting cluster there would be at least one other entity 
in the same cluster such that the pair would have a similarity 
at least as large as s^^ . 

The method is known as single linkage because clusters 
are joined at each stage by the single shortest or strongest 
link between them. Since the updating process involves 
choosing only the minimum or maximum single-linkage clustering 
is invariant to any transformation which leaves the 
ordering of the similarities unchanged; that is, any monotonic 
transformation. 



4 . Complete Linkage 

The complete-linkage method is related to the single- 
linkage method and is no more difficult to execute. At each 
stage, after clusters p and q have been merged, the 
similarity between the new cluster (labeled t) and some 
other cluster r is determined as follows: 

1. If Sj^j is distance-like measure 



s 



tr 



max 




^qr^ 



(39) 



The quantity s^^ is the distance between the most distant 
members of clusters t and r . If clusters t and r 
were merged, then every entity in the resulting cluster 
would be no farther than s^^ from every other entity in 
the cluster. The value of s^^ is the diameter of the 
samllest sphere which can enclose the cluster resulting from 
the merger of clusters t and r. 
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2. If s.. 

ij 



is a correlation- like measure 



= tr " f=pr’=qr> (■*“) 

The quantify s^^ is the similarity between the two most 
dissimilar entities in clusters t and r . If clusters 
t and r were to be merged, then every entity in the 
resulting cluster would have a similarity of at least s* 
with every other entity in the cluster. 

The method is called complete linkage because all 
entities in a cluster are linked to each other at some 
maximum distance or minimum similarity. Such a cluster 
is called a "maximally connected subgraph" in graph theory. 
In contrast to the single-linkage method, interpretation 
of the clusters can be made only in terms of the relation- 
ships within individual clusters; there is no particularly 
useful interpretation involving the differences between 
clusters. Like the single-linkage method, complete-linkage 
cluster analysis is invariant to monotonic transformations 
of the similarity measure. Johnson (1967) discusses this 
property in both single and complete linkage methods. 

D. NONHIERARCHICAL CLUSTERING 

Nonhierarchical clustering methods are designed to 
cluster data units into a single classification of g 
clusters, where g either is specified a priori or is 
determined as a part of the clustering method. The central 
idea in most of these methods is to choose some initial 
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partition of the data units and then alter cluster member- 
ships so as to obtain a better partition. The various 
algorighms which have been proposed differ as to what 
constitutes a "better partition" and what methods may be used 
for achieving improvements. 

The broad concept for these methods is very similar 
to that underlying the steepest descent algorithms used 
for unconstrained optimization in nonlinear programming. 

Such algorithms begin with an initial point and then 
converge to a local optimum, moving one step at a time, 
the value of the objective function improving at each step. 

The methods of nonhierarchical clustering typically 
may be used with much larger problems than the hierarchical 
methods because it is not necessary to calculate and store 
the similarity matrix; it is not even necessary to store 
the data set. In general, the data units are processed 
serially and can be read from tape or disk as needed. This 
characteristic makes it possible, at least in principle, to 
cluster arbitrary large collections of data units. 

In this research, we consider only the partitioning 
method known as "K-MEANS" which was developed by MacQueen 
(15) . He used the term "K-MEANS" to denote the process of 
assigning each data unit to that cluster (of k clusters) 
with the nearest centroid (mean vector). The cluster 
centroid changes with each transfer of an observation. 

The decomposition of the total scatter matrix into 
within and between groups matrices suggests possible 
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optimality criteria to be used in a clustering algorithm. 

One would like the within-groups scatter to be small 
relative to the between-groups scatter. Various trial 
clusterings could be formed using the W and B matrices 
as a basis for the optimality criteria which determine the 
best clustering. A possible choice for a criterion is to 
minimize trace W over all partitions into g groups. 

Since T is constant over all partitions, minimizing trace 
W is equvalent to maximizing traces B since 

trace T = trace W + trace B C41) 

Although trace W is invariant under an orthogonal 
transformation, it is not invariant under other non-singular 
linear transformations. 

McRae (16) points out that trace W equals the total 
within group sum of squares, hence the "minimum variance 
partition" cluster solution is found by minimizing trace W . 

Considerable study has been developed to alternative 
criteria such as those based on multivariate statistical 
analysis techniques, especially the methods of linear 
discriminant analysis and multivariate analysis of variance. 
Assuming the p variables are not linearly dependent, then 
as long as p = N - g , W is positive definite symmetric 
and so is W'^ . Attempts to make B and W as different 
as possible lead one to solving the determinantal equation: 

iB - AW] = 0 (42) 
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The solutions are the eigenvalues of the matrix W 



as in discriminant analysis. There are t non- zero 
eigenvalues, where t is the minimum of p and g-1 . 

This is a consequence of the fact that, if g is less than 
p , .the g group means are considered in a Cg*l) "diniensional 
hyperplane. When g = 2 the analysis is equivalent to 
two-group discriminant analysis. Linear discriminant 
analysis would take the vectors originally described in 
p-dimensional coordinate system and transform the basis to a 
t-dimensional system. Maximizing the largest of these 
eigenvalues is a criterion suggested by S.N. Roy and 
maximizing the trace of W , however is a criterion 
suggested by Hotelling. In both cases, large values for 
these statistics are sought in clustering algorithms since 
large values indicate large differences among (between) 
groups. Minimizing the ratio of determinants [W| t |T| 
is a criterion widely known as Wilks' lambda discussed in 
the discriminant analysis. Since T is the same for all 
partitions, this criterion is equivalent to minimizing 
determinant W . Both trace W ^B and |T| t |Wl may 
be expressed in terms of the eignevalues of W ^B . 




(43) 



trace W ^B 



t 

Z 

i=l 



(44) 
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where t = min(p, g-1) . Therefore minimizing det W is 
equivalent to maximizing tt(1 + . 

Friedman and Rubin (6) describe the advantages of the 
various criteria. Those based on multivariate statistical 
considerations (all but trace W ) are invariant under 
changes in scale for varibles (non-singular linear trans- 
formation) . In fact, they are the only invariants for W 
and B under such transformations. In addition, the 
multivariate criteria may take into account covariation 
among the variables. 
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V. ANALYSIS OF MULTIVARIATE UTILITY DATA 



To illustrate hierarchical clustering we applied the 
technique described in the previous chapter to partition a 
set of twenty six attributes of a close-air support weapon 
system into a smaller collection of "superattributes” . As 
part of an effort to evaluate the military utility of a 
proposed alternative U.S. Marine Corps air support rada 
system, AN-TPQ/27. Barr and Richards C4) extracted 26 
attributes of the TPQ-27 and a baseline system, the AN-TPQ/10, 
and then had members of the Operational Test and Evaluation 
Team assess the utility of the TPQ/27 relative to that of the 
TPQ/10. In order that the additive model used to combine 
unidimensional relative utilities into a system relative 
utility be justifiable, it is necessary that the utilities 
satisfy certain independence properties described in Keeney 
and Raiffa (12) . 

Because those independence properties are very diffi- 
cult for decision makers to verify for complex alternatives 
like the weapon systems under study. Professors Barr and 
Richards attempted instead to work with the attributes to 
try to generate a new collection which would likely satisfy, 
at least approximately, the conditions required to justify 
the additive model. 

The original collection of 26 attributes is as follows: 



52 



1. Portability 

2. Durability 

3. Time to Set Up 

4. Time to Take Down 

5. Ease of Assigning Aircraft to Targets 

6. Number of Aircraft Controlled 

7. Number of Targets 

8. Communications 

9. Mission Flexibility 

10. ASRT Survivability 

11. Time to Locate and Acquire Aircraft 

12. Accuracy of Tracking 

13. Accuracy of Delivery 

14. Range 

15. Aircraft Vulnerability 

16. Aircraft Attack Throughout 

17. Base of Adjustment and Evaluation of Results 

18. Accuracy of Feedback 

19. Ease of Operation 

20. Man-Machine Compatibility 

21. Training Requirements 

22. Reliability 

23. Maintainability 

24. Supportability 

25. Availability 

26. Documentation 
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where represents an attribute i and 





if 

if 



='ij = ='kj 
='ij * ='kj 



It is easy to verify that D is a metric as defined in 
Chapter IV. Since we will actually work with a similarity 
measure in the hierarchical cluster procedure, we define 
the similarity between two attributes a^ and a^^ as 



12 

S(a- , 3 - 1 .) ~ S I (x- • 

1 K . ^ ^ IJ 




( 46 ) 



One can see from this definition that the similarity between 
two attributes a^ and a^ is simply the number of team 
members who placed attributes a^ and a^ in the same 
partition. For example, 

S(a^ ,a2)=0+l+0+l+0+0+0+l+0+0+0 

+ 1 + 0 = 4 



Either S or D can be used in the computer program 
shown in Appendix A for hierarchical clustering. One need 
only indicate whether he wants a correlation-like (larger 
values imply more similar) measure or a distance-like 
measure (smaller values imply more similar). We selected to 
use the former method. The similarity matrix extracted from 
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Table II. Data Matrix 



123456789 10 11 12 



1 6 2 13 

2 12 2 3 

3 6 115 

4 6 115 

5 2 5 3 1 

6 2 7 4 1 

7 2 7 4 1 

8 3 5 4 7 

9 2 5 4 1 

10 9 6 5 6 

11 2 8 6 4 

12 3 8 7 4 

13 3 8 7 4 



2 4 

1 3 

2 4 

2 5 

3 1 

4 1 

4 1 

5 6 

3 1 

5 7 

3 2 

4 2 

4 2 



3 13 

2 11 
3 16 

3 1 6 

12 2 

12 4 

1 2 4 

13 1 

12 4 

8 3 7 

14 4 

6 5 4 

6 5 4 



111 

5 12 

12 1 

12 1 

2 3 3 

2 3 3 

2 3 3 

3 3 6 

2 3 3 

3 2 4 

2 3 3 

6 4 3 

6 4 3 



14 3 8 

15 7 6 

16 8 8 

17 8 8 

18 4 8 

19 4 5 

20 4 5 

21 5 4 

22 1 3 

23 1 3 

24 1 3 

25 1 3 

26 5 4 



4 4 4 

5 6 5 

6 14 

8 4 3 

8 4 6 

3 13 

3 3 3 

9 2 3 

2 3 1 

9 3 1 

4 3 1 

2 3 1 

9 2 6 



2 



3 



7 7 3 7 

17 4 4 

2 5 6 5 

2 5 6 5 

112 2 

3 3 1 2 

8 4 7 2 

3 2 7 1 

3 2 7 1 



3 7 4 

2 7 3 

4 5 3 

6 5 3 

4 3 3 

4 8 3 

4 6 5 

5 6 2 

5 6 2 



3 2 7 

3 2 7 

8 4 7 



3 5 6 2 

1 5 6 2 

2 4 6 5 
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from the data is shown in Table V-3. We present only lower 
triangular elements since S(a^ , a^) = 12 for all i and 
the matrix is symmetric; i.e., S(a^ , a^^) = S(aj^ , a^^) . 

Zero values are not written. 

The results from the hierarchical clustering are shown 
in Figure 4. The numbers printed along the left hand 
margin refer to the attribute numbers. As you proceed to 
the right through the tree you will observe numbers greater 
than 26. These correspond to the clusterings that takes 
place from one step to the next. For example, the number 
27 shown at the juncture of 25 and 22 means that the first 
attribute clustered together should be 25 and 22 (this is 
the most similar pair) . This combination is then considered 
as a new attribute which is later combined with the attribute 
30 (itself a combination of 23 and 24) to form the attribute 
31. This is later combined with attribute 2 to form 
attribute 40, etc. 

As discussed in Chapter IV a decision has to be made as 
to how many clusters (superattributes) are desired. All 
hierarchical methods will continue clustering until there 
is a single cluster. In order to decide on the number of 
clusters (and their composition) one need only image drawing 
a vertical line through the tree at various places. Each 
intersection of the tree with the vertical line results in a 
cluster. For example, teh vertical line at the point A 
results in the 6 clusters shown in Table V-4. 
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It is clear from observing the above collection that 
some of the attributes are highly correlated and nonredun- 
dant. If one tries to assign an importance weights to each 
attributes separately, there is a distinct likelihood that 
some of the overlapping strongly into related attributes 
might effectively be double or triple weighted or more 
producing biased result. It is an effort to prevent this 
from happening, Barr and Richards aksed the utility assess- 
ment team to partition the 26 attributes into a smaller 
collection in such a way that attributes within a group are 
similiar and attributes in different groups are unrelated the 
sense that utility assessments for attributes in one group 
do not depend on the amounts of attributes in any other 
group . 

The total number of groups was not prespecified. 

Instead, each team member was allowed to partition the 26 
attributes into any number of groups. The resulting multi- 
variate data array is shown in Table V-2. An element 
is the number of the group into each team member j put 
attribute i . 

Let us define a distance measure for this data array 
as follows: 



12 

D(a. ,a, ) = z: (1 

1 i' j = 1 




(45) 
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Table III. Similarity Matrix for Superattribute Deternd.nation 
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The superattributes used in the utility study are those 
sohwn in Table IV. A careful examination of the attributes 
which comapre the clusters shows that the results so obtained 
are intuitively agreeable. The names supplied to the 
superattribures are somewhat natural descriptions of the 
clusters obtained. 

The listing of the computer program and sample output 
are given in Appendix A. 
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Table IV. 



Super at tributes 



Super at tributes 




Component Attributes 


Facility of movement 


1. 


Portability 




3. 


Time to set up 




4. 


Time to take down 


Facility of Use 


5. 


Ease of assigning 






aircraft to targets 


(precision) 


6. 


Number of aircraft 






controlled 




7. 


Number of targets 




9. 


Mission flexibility 




11. 


Time to locate and 






acquire aircraft 




16. 


Aircraft attack 






throughput 




17. 


Ease of adjustment 




18. 


Accuracy of feedback 




19. 


Ease of operation 




20. 


Man-machine compatibility 




12. 


Accuracy of tracking 




13. 


Accuracy of delivery 




14. 


Range 


Survivability 


10. 


ASRT Survivability 




15. 


Aircraft vulnerability 


Learning 


21 •. 


Training requirements 




26. 


Documentation 


Readiness 


2. 


Durability 




22. 


Reliability 




23. 


Maintainability 




24. 


Supportability 




25. 


Availability 


Communications 


8. 


Communications 
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VI. ANALYSIS OF ARMY TANK DATA 



A . DATA STRUCTURE 

In order to illustrate the nonhierarchical clustering 
methodology, principal components analysis*, and discriminant 
analysis data on Army tanks from eight different countries 
were taken from Jane’s Book of Weapon Systems (1979-80). 

A total of twenty-four tanks were included in the data 
array with observation on each of 10 variables. The 10 
variables are listed below: 



1. 


Weight (ton) 




2.' 


Length (meter) 




3. 


Width (meter) 




4. 


Height (meter) 




5. 


Road Speed (kilometer per hour) 


6. 


Trench Crossing 


(meter) 


7. 


Ground Pressure 


(Kg/cm^) 


8. 


Maximum Armament 


(rounds) 


9. 


Ground Clearance 


(meter) 


10. 


Power to Engine 


Ratio (BHP/ton) 



The twenty-four tanks and the associated countries are 
shown below: 

Identification Number Type/Name Country 

11 T-62 

12 T-54 U.S.S.R. 
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Identification Number 


Type/Name 


Country 


13 


T-10 




14 


ASU-85 




15 


MK- 5/ Chieftain 




16 


MK-3/Vickers 




17 


MK- 13/ Centurion 


U.K. 


18 


CVRCT) /Scorpion 




19 


XM-1 




20 


M60A2 




21 


M60 


U.S.A. 


22 


M4 8 




23 


M4 7 




24 


PZ61 


SWITZER- 


25 


P268 


LAND 


26 


STRV-103 


SWEDEN 


27 


Ikv-91 




28 


TYPE61 


JAPAN 


29 


TYPE 7 4 




30 


Leopard 2 




31 


Leopard 


W. GER- 
MANY 


32 


TAM 




33 


AMX 30 


FRENCH 


34 


AMX 13 
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We conjecture that a cluster analysis of the tank data will 
result in clusters corresponding to nationality since the 
nations may have different emphasis on the variables in 
the design of their tanks. 

B. NONHIERARCHICAL CLUSTER ANALYSIS OF TANK DATA 
1 . The MIKCA Algorithm 

The specific algorithm chosen for the nonhierarchical 
cluster analysis for the tank data is the MIKCA (Multivariate 
Iterative K-MEANS Clustering Algorithm) program written by 
Douglas J. McRae as a part of his doctoral dissertation 
at the University of North Carolina, Chapel Hill. 

Reference to the flow chart in Figure 5 will aid the 
reader in following discussion of the algorithm. Inputs to 
program are the data matrix, an estimate for g (the 
number of clusters) , and choice of criterion and distance 
functions . 

In the first step, preliminary claculations are made, 
such as the variable means and standard deviations, as 
well as the cross product matrix T . The next step forms 
the initial cluster centers. Then each of the other 
observations is assigned to the nearest cluster. Euclidean 
distance is used for this initial phase, and the cluster 
centroids are recomputed after each observation is assigned 
to a group. The observations are considered in the same 
order as they were input. After all of them have been 
assigned to clusters, the criterion value is computed. 

This initial cluster-finding technique is referred to as a 
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If J4' ! 





Figure 5. MIKCA Flow Chart 
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one-pass K-MEANS procedure. It is performed three times, 
and the solution which yields the best criterion value is 
chosen as the initial cluster solution. 

After the initial solution has been found, the program 
advances to the iterative K-MEANS phase where the observations 
are again considered in the order in which they were input to 
the program. It is this phase where the user's choice of 
distance function is used. The distance from each observa- 
tion to each cluster centroid is again computed, this time 
with the user's distance function, the assignment to the 
closest centroid being made and the centroid updated to 
reflect its new membership. After considering all n 
observations in this manner, the new criterion value is 
checked for possible improvement during the K-MEANS iteration. 
As long as the criterion value improves, the K-MEANS 
procedure is repeated; if the criterion fails to improve then 
the MIKCA algorithm goes to the next step, the individual 
switches section. Note the importance of the order of 
consideration of the observations. The order is important 
because the cluster means are recomputed after each observa- 
tion is reassigned. 

In the individual switches phase, consideration is given 
to moving each observation to every other cluster, the move 
being made if and only if an improvement in the value of 
the criterion results. An elaborate labelling procedure 
provides a unique order in which to consider each observation. . 
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This procedure continues until a complete pass through the 
data is made with no changes in cluster membership. 

The MIKCA alogorithm provides the following options for 
distance and criterion functions. 

Criterion 

1 . Minimum trace W 

2. Minimum determinant W 

3. Maximum largest order- of [B - XW| = 0 

4. Maximum sum of roots of |B - XW| = 0 

Distance 

1. Euclidean 

2. Weighted Euclidean 

3. Mahalanobis 

A complete computer program is listed in Appendix B. 

2 . Cluster Results for Tank Data 

For clustering of the tank data we selected the 
minimum trace W criterion and the weighted Euclidean 
distance function. The algorithm automatically provides 
weights for the weighted Euclidean distance function. 

The results of the clustering with four clusters are shown 
in Table V. 

The conjecture of clustering by nationalities is 
supported by the results. The three Soviet tanks make up 
one cluster and the two British and four of the United 
States tanks were found to be similar. A third cluster 
consists of four tanks which are very lightweight. The 
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final cluster consists of the rest of the tanks, including 
tanks of United States allies from West Germany, France, 
Sweden, Switzerland and Japan. 

A natural question to ask after observing the results 
of a cluster analysis is what variables most strongly 
influence the clustering that was observed. A clue is 
provided by the composition of the cluster containing all of 
the lightweight tanks. This suggests that weight is an 
important distinguishing feature. This is examined in the 
principal components analysis and the discriminant analysis 
in the next two section. 

C. PRINCIPAL COMPONENTS ANALYSIS 

The Statistical Package for Social Sciences (SPSS) 

(14) subprogram FACTOR was used for the principal components 
analysis. It is designed both for the factor analysis and 
the principal components analysis. The outputs are designed 
to be self-explanatory. In this example, the first 5 
components accoung for 90% of the variance and the remaining 
components account for only 10% of the variance (Figure 6) . 

The subprogram FACTOR provides a graphical presentation 
(Figure 7) for the factors that have been determined by the 
orthogonal rotations (in this example, variance maximization 
rotation). In reading the graphs, one should be attentive 
to following three features: (1) the relative distance of 

a variable from the axis, (2) the direction of a variable 
in relation to the axis, and finally (3) clustering of 
variables and their relative position to each other. 
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Figure 6. Summary Table of Principal Components Analysis on Tank 
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Figure 7 . Graphical Presentation 
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For example, variables 5 (road speed) and 10 (power to 
engine ratio) contribute heavily to the first principal 
component while variables 1 (wieght) and 3 (width) 
contributes most strongly to the second principal component. 
Variables 2, 4, 6, 7, 8, 9 are not as important. The 
weights accorded each variable in the 10 factors (principal 
components) are shown in Figure 8. The complete _SPSS 
program is listed in Appendix C. 

D. DISCRIMINANT ANALYSIS 

The SPSS subprogram DISCRIMINANT was used to determine 
that function or those functions of the 10 variables that 
best discriminant among the four clusters determined in 
previous section. 

The maximum number of discriminant functions to be 
derived is either one less than the number of groups or equal 
to the number of discriminating variables. This subprogram 
provides two measures for judging the importance of 
discriminant functions. One of these is the relative 
percentage of the eigenvalue associated with the function. 

It is a measure of the relative importance of the function. 
The sum of the eigenvalues is a measure of total variance 
existing in the discriminating variables. Since discriminant 
functions are derived in order of their importance, this 
process can be stopped whenever the relative percentage is 
judged to be too small. Of course, there is no fixed rule 
for deciding whatis too small. In this research, we selected 
arbitrary, a significance level of 0.10. The output shown 
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in Figure 9 suggests that we therefore consider only the 
first two discriminant functions. 

The second measure judging the importance of a 
discriminant function is its associated canonical correlation. 
The canonical correlation is a measure of association between 
the single discriminant function and the set of Cg'l) dummy 
variables which define the g group memberships. It tells 
us how closely the function and the group variable are 
related, which is just another measure of the function's 
ability to discriminate among the groups. From Figure 10, 
the first two discriminant functions are each highly corre- 
lated with the groups but the third has only a moderate 
correlation . 

The next criterion for eliminating discriminant 
functions is to test for the statistical significance of 
discriminating information not already accounted for by 
the earlier functions. As each function is derived, 
starting with no (zero) functions, Wilks’ lambda is computed. 
Lambda is an inverse measure of the discriminating power in 
original variables which has not yet been removed by the 
discriminant functions - the larger lambda is, the less 
is the information remaining. Lambda can be transformed into 
a chi-square statistic for an easy test of statistical 
significance. In Figure 9, Wilks' lambda was .594 after 
the first two functions had been derived. This corresponds 
to a chi-square of 8.8476 with a probability level of .1823. 
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Figure 9. Summary Table of Discriminant Analysis on Tank 
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Figure 10. Canonical Discriminant Function Coefficients 



This means that a lambda of this magnitude or smaller has a 
.1823 probability of occurring due to the chances of 
sampling even if there was no further information to be 
accounted for by a third function in the population. 

Clearly, a third function is not statistically significant 
in this case. 

The standarized discriminant function coefficients 
corresponding to the values of the discussed in the 

previous section are used to compute the discriminant score 
for a case (observation) in which the original discriminating 
variables are in standard form. The discriminant score is 
computed by multiplying each discriminating variable by its 
corresponding coefficient and adding together these products. 
There is a separate score for each observation on each 
function. The coefficients have been derived in such a way 
that the discriminant scores produced are in standard form. 

When the sign if ignored, each standard discriminant 
function coefficient represents the relative contribution of 
its associated variable to that function. The sign merely 
denotes whether the variable is making a positive or -.negative 
contribution . 

A graphical presentation is shown in Figure 11 using 
the first and the second canonical discriminant function as 
the axis. From this scatterplot, we can easily see that 
Soviet tanks (labelled 1) are well distinguished from the 
all of the others using only the first two discriminant 
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CANONICAL OISCniMINANT KNCTION t 
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Figure 11. Scatter Plot for all Groups 



functions. Also, all the lightweight tanks are clearly 
separated from the others. The distinction between groups 
2 and 3 is also clear though not separated from each other 
as much as from groups 1 and 4. The complete SPSS program 
for the discriminant analysis is listed in Appendix D. 
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VII. CONCLUSION 



The multivariate analysis techniques of cluster analysis, 
principal components analysis and discriminant analysis are 
useful in real world problems for examining observations 
on each of several dimension. Each of the techniques is 
related mathematically to the others, and each complements 
the other in explaining the data. 

Computer software is readily available in many sources. 
The softv;are used in this thesis for hierarchical clustering, 
principal components analysis, and discriminant analysis was 
from the IMSL package and SPSS. For nonhierarchical 
clustering, we used the FORTRAN program developed by McRae 
(16) . All of this software is readily available and 
documented at the Naval Postgraduate School. 
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